Comments for MEDB 5501, Week 14

Topics to be covered

  • What you will learn
    • Interpretation of regression slope and intercept
    • Assumptions in linear regression
    • Calculation of sums of squares
    • Confidence intervals and hypothesis tests
    • Relationship to the correlation coefficient
    • Definition of residuals
    • Diagnostic plots
    • Interpretation with two independent variables

Bad joke, 1 of 4

Bad joke, 2 of 4

Bad joke, 3 of 4

Bad joke, 4 of 4

Algebra formula for a straight line

  • \(Y=mx+b\)
  • \(m = \Delta y / \Delta x\)
  • m = slope
  • b = y-intercept

Linear regression interpretation of a straight line

  • The slope represents the estimated average change in Y when X increases by one unit.

  • The intercept represents the estimated average value of Y when X equals zero.

First regression example with interpretation

Output from SPSS

Scatterplot of fat percentage and abdomen circumference

Correlating fat percentage with abdomen circumference, 1 of 3

Correlating fat percentage with abdomen circumference, 2 of 3

Correlating fat percentage with abdomen circumference, 3 of 3

R Square measure in linear regression

ANOVA table in linear regression

Regression coefficients

Predicting fat percentage from abdomen circumference (8/10)

QQ plot of residuals

Scatterplot of residuals

Break #1

  • What you have learned
    • Interpretation of regression slope and intercept
  • What’s coming next
    • Assumptions in linear regression

The population model

  • \(Y_i=\beta_0+\beta_1 X_i + \epsilon_i,\ i=1,...,N\)
    • \(\epsilon_i\) is an unknown random variable
      • Mean 0, standard deviation, \(\sigma\)
      • Often assumed to be normal
    • \(\beta_0\) and \(\beta_1\) are unknown parameters
    • \(b_0\) and \(b_1\) are estimates from the sample

Least squares principle (1/3)

Least squares principle (2/3)

Least squares principle (3/3)

Violations of this model

  • Nonlinearity
  • Heterogeneity
  • Non-normality
  • Lack of independence

Break #2

  • What you have learned
    • Assumptions in linear regression
  • What’s coming next
    • Calculation of sums of squares

Artificial data

  y  x
 11 13
 15  9
 19 15
 21  7
 25 11
 29  5
  • \(\bar{X}\)

  • X-bar = 10

  • Y-bar = 20

  • SD(X) = 3.7

  • SD(Y) = 6.5

Sum of squares regression

Sum of squares error

Sum of squares total / corrected total

Sum of squares total (uncorrected)

ANOVA table for linear regression

\[\begin{matrix} & SS & df & MS & F-ratio \\ Regression & SSR & 1 & MSR=\frac{SSR}{1} & F=\frac{MSR}{MSE} \\ Error & SSE & n-2 & MSE=\frac{SSE}{n-2} & \\ Total & SST & n-1 & & \end{matrix}\]

Review: ANOVA table for oneway ANOVA

\[\begin{matrix} & SS & df & MS & F-ratio \\ Between & SSB & k-1 & MSB=\frac{SSB}{k-1} & F=\frac{MSB}{MSW} \\ Within & SSW & n-k & MSW=\frac{SSW}{n-k-1} & \\ Total & SST & n-1 & & \end{matrix}\]

R-squared

  • SST, total variation, is split into
    • SSR, explained variation, and
    • SSE, unexplained variation
  • \(R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\)
    • \(0 < R^2 < 1\)
    • Proportion of explained variation

SPSS linear regression

  • Row 1: SSR, df, MSR, F-ratio, p-value
  • Row 2: SSE, df, MSE
  • Row 3: SST, df

SPSS General Linear Model

  • Row 1: SSR, df, MSR, F-ratio, p-value
  • Row 3: same as Row 1
  • Row 4: SSE, df, MSE
  • Row 6: SST, df

Break #3

  • What you have learned
    • Calculation of sums of squares
  • What’s coming next
    • Confidence intervals and hypothesis tests

Confidence interval

  • \(b_1 \pm t(\alpha/2, n-2) s.e.(b_1)\)
    • \(s.e.(b_1)=\sqrt{\frac{MSE}{\Sigma (X_i-\bar{X})^2}}\)
  • How to get a narrower confidence interval
    • Decrease the noise (MSE)
    • Increase the sample size
    • Increase the spread of the X’s

Hypothesis test

  • \(H_0:\ \beta_1=0\) vs. \(H_1:\ \beta_1 \ne 0\)
  • Compare \(T=\frac{b_1}{s.e.(b_1)}\) to \(t(1-\alpha/2; n-2)\)
    • Accept \(H_0\) if T is close to zero
    • Reject \(H_0\) if T is large negative or large positive

Equivalent hypothesis test

  • Compare \(F=\frac{MSR}{MSE}\) to \(F(1-\alpha; 1, n-2)\)
    • Accept \(H_0\) if F is close to one
    • Reject \(H_0\) if T is large positive
  • Note: \(F=T^2\)

Two more equivalent tests

  • Compute the p-value from T or F
    • Accept \(H_0\) if \(p-value > \alpha\)
    • Reject \(H_0\) if \(p-value \le \alpha\)
  • Compute the confidence interval (CI) for \(\beta_1\)
    • Accept \(H_0\) if CI includes 0
    • Reject \(H_0\) if CI does not include 0

Break #4

  • What you have learned
    • Confidence intervals and hypothesis tests
  • What’s coming next
    • Relationship to the correlation coefficient

Calculation of the regression slope and intercept

  • \(b_1=\frac{\Sigma (X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma (X_i-\bar{X})}\)
  • \(b_0=\bar{Y}-b_1\bar{X}\)

Relationship to the correlation coefficient

  • Recall from the previous module
    • \(Cov(X,Y)=\frac{1}{n-1}\Sigma(X_i-\bar{X})(Y_i-\bar{Y})\)
    • \(r_{XY}=\frac{Cov(X,Y)}{S_X S_Y}\)
  • This implies that
    • \(b_1=r_{XY}\frac{S_Y}{S_X}\)

Important implications

  • \(r_{XY}\) is unitless, \(b_1\) is Y units per X units
  • \(r_{XY}>0\) implies \(b_1>0\)
  • \(r_{XY}=0\) implies \(b_1=0\)
  • \(r_{XY}<0\) implies \(b_1<0\)
    • and vice versa

Break #5

  • What you have learned
    • Relationship to the correlation coefficient
  • What’s coming next
    • Definition of residuals

Predicted values

  • For a new value of X
    • \(\hat{Y}_{new}=b_0+b_1 X_{new}\)
  • For an existing value in the data, \(X_i\)
    • \(\hat{Y}_i=b_0+b_1 X_i\)

Why predict for a value you already have seen?

  • Future Y may differ from previous Y
  • Comparison of \(\hat{Y}_i\) to existing \(Y_i\).

Residual

  • \(e_i=Y_i-\hat{Y}_i\)
    • Error in prediction
    • \(\Sigma e_i=0\)
    • Estimate of \(\epsilon_i\)

Break #6

  • What you have learned
    • Definition of residuals
  • What’s coming next
    • Diagnostic plots

Testing the various assumptions

  • Nonlinearity
    • Scatterplot of residuals vs. independent variable
  • Heterogeneity
    • Scatterplot of residuals vs. independent variable
  • Non-normality
    • Q-Q plot of residuals
  • Lack of independence
    • Usually assessed qualitatively
    • Durbin-Watson test for serial correlation

Break #7

  • What you have learned
    • Diagnostic plots
  • What’s coming next
    • Interpretation with two independent variables

Model

  • \(Y_i=\beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\epsilon_i\)
  • Least squares estimates: \(b_0,\ b_1,\ b_2\)

Interpretations

  • \(b_0\) is the estimated average value of Y when X1 and X2 both equal zero.
  • \(b_1\) is the estimated average change in Y
    • when \(X_1\) increases by one unit, and
    • \(X_2\) is held constant
  • \(b_2\) is the estimated average change in Y
    • when \(X_2\) increases by one unit, and
    • \(X_1\) is held constant

Unadjusted relationship between height and FEV

Relationship between height and FEV controlling at Age=3

Relationship between height and FEV controlling at Age=3

Relationship between height and FEV controlling at Age=4

Relationship between height and FEV controlling at Age=5

Relationship between height and FEV controlling at Age=6

Relationship between height and FEV controlling at Age=7

Relationship between height and FEV controlling at Age=8

Relationship between height and FEV controlling at Age=9

Relationship between height and FEV controlling at Age=10

Relationship between height and FEV controlling at Age=11

Relationship between height and FEV controlling at Age=12

Relationship between height and FEV controlling at Age=13

Relationship between height and FEV controlling at Age=14

Relationship between height and FEV controlling at Age=15

Relationship between height and FEV controlling at Age=16

Relationship between height and FEV controlling at Age=17

Relationship between height and FEV controlling at Age=18

Relationship between height and FEV controlling at Age=19

Unadjusted relationship between age and fev

Relationship between age and FEV controlling for height between 46 and 49.5

Relationship between age and FEV controlling for height between 50 and 53.5

Relationship between age and FEV controlling for height between 54 and 57.5

Relationship between age and FEV controlling for height between 58 and 61.5

Relationship between age and FEV controlling for height between 62 and 65.5

Relationship between age and FEV controlling for height between 66 and 69.5

Relationship between age and FEV controlling for height between 70 and 73.5

Summary

  • What you have learned
    • Interpretation of regression slope and intercept
    • Assumptions in linear regression
    • Calculation of sums of squares
    • Confidence intervals and hypothesis tests
    • Relationship to the correlation coefficient
    • Definition of residuals
    • Diagnostic plots
    • Interpretation with two independent variables

Additional topics??